1. Installation

Install all packages in the latest version of R.

devtools::install_github("zhouzilu/DENDRO")

In addition to DENDRO package, we also needed to download the reference RNA-seq dataset.

2. Questions & issues

If you have any questions or problems when using DENDROplan, please feel free to open a new issue here. You can also email the maintainers of the package – the contact information is below.

3. DENDROplan pipeline

3.1 Overall pipeline

Figure 2 illustrate the overall pipeline. Figure 2. A flowchart outlining the procedures for DENDROplan.

3.2 Load reference

Our reference dataset is from Deng et al. with great sequencing properties. First, let’s load DENDROplan_ref and check the structure of the variable ref

library(DENDRO)
data("DENDROplan_ref")
str(ref)
#> List of 3
#>  $ X1       : int [1:22958, 1:133] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:133] "GSM1112611_earlyblast_2-1_expression.txt" "GSM1112612_earlyblast_2-10_expression.txt" "GSM1112613_earlyblast_2-12_expression.txt" "GSM1112614_earlyblast_2-15_expression.txt" ...
#>  $ X2       : int [1:22958, 1:133] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:133] "GSM1112611_earlyblast_2-1_expression.txt" "GSM1112612_earlyblast_2-10_expression.txt" "GSM1112613_earlyblast_2-12_expression.txt" "GSM1112614_earlyblast_2-15_expression.txt" ...
#>  $ exp_level: chr [1:22958] "low" "low" "low" "low" ...

3.3 Tree simulation

3.3.1 Tree type and parameters

The trees that DENDROplan is able to simulate show at Figure 3. The corresponding k and subtype value are also included.

Numbers on branches illustrate index of mutation probability (variable lprob) and numbers at node illustrate index of cell proportion (variable kprob). The interested clade is labeled by a star.

Figure 3. Example tree structure DENDRO can simulate with corresponding parameters.

3.3.2 Accuracy measurements

We measure DENDROplan results by four statistics, explained here:

  • Adjust Rand index: Adjusted Rand index is a measure of the similarity between two data clusterings after adjusted for the chance grouping of elements. For details, see here

  • Capture rate: Capture rate is a measure of “false negative rate” of a specific clade. Out of all the cells from the specific clade, how many of them is detected by the algorithm.

  • Purity: Purity is a measure of “false positive rate” of a specific clade. Out of all the cells in the “specific cluster” you detected, how many are actually from the true specific clade.

  • Observation probability: This is only one single value, which measures the probability of observe all clades in our multipe simulation round.

3.4 Practice

3.4.1 Exampe I

Assume we only have 100 mutation sites and our tree looks like Figure 3c with mutations and cells uniformly distributed

res=DENDRO.simulation(n=100,ref=ref,k=4,subtype=1)
#> There are total  4  clades in our simulation

res
#>        AdjustRandIndex Capture rate    Purity Observation probability
#> CI_low       0.6300837    0.7692308 0.6923077                      NA
#> Mean         0.8517463    0.9454780 0.9534285                       1
#> CI_up        1.0000000    1.0000000 1.0000000                      NA

Result shows the example tree structure as well as statistics with 95% Confidence Interval (CI).

3.4.2 Example II

In the 2nd example, we want to specify the similar tree but with customized cell proportion. We want the four clusters have cell proportion 0.2, 0.2, 0.2 and 0.4, i.e. we want one major subclone. In our code we need to specify this by parameter kprob.

res=DENDRO.simulation(kprob=c(0.2,0.2,0.2,0.4),n=100,ref=ref,k=4,subtype=1)
#> There are total  4  clades in our simulation

res
#>        AdjustRandIndex Capture rate    Purity Observation probability
#> CI_low       0.5999641    0.7027027 0.8437500                      NA
#> Mean         0.8631731    0.9402432 0.9727299                       1
#> CI_up        1.0000000    1.0000000 1.0000000                      NA

Example tree shows consistency with our input.

3.4.3 Example III

One important reason for user to use this tool is that it can estimate the cluster accuracy given different sequencing depth. In the original Deng et al. paper, they collect around 10,000,000 ~50bp reads per cell in mice, resulting 46 mapped reads for each mutation site, which is super high depth (Figure 4).

Figure 4. scRNA-seq library statistics for Deng et al. Modified from original paper.

In real life, usually 1,000,000 reads per cell is pretty good depth. Proportionally, there are around 4.5 mapped reads per mutation site. Let’s change the parameter RD and see how well DENDRO performs.

res=DENDRO.simulation(RD=4.5,n=100,ref=ref,k=4,subtype=1)
#> There are total  4  clades in our simulation

res
#>        AdjustRandIndex Capture rate    Purity Observation probability
#> CI_low       0.5600606    0.7241379 0.7307692                      NA
#> Mean         0.8192305    0.9327139 0.9453612                       1
#> CI_up        0.9744867    1.0000000 1.0000000                      NA

Feel free to play with different parameters. The detailed function document can be found by typing ??DENDRO.simulation in R

4. Session info

sessionInfo()
#> R version 3.4.1 (2017-06-30)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 10240)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] DENDRO_0.1.0     ggplot2_2.2.1    mclust_5.4       phyclust_0.1-22 
#>  [5] colorspace_1.3-2 svMisc_1.1.0     phangorn_2.4.0   ape_5.0         
#>  [9] dendextend_1.8.0 TailRank_3.2.1   oompaBase_3.2.6 
#> 
#> loaded via a namespace (and not attached):
#>  [1] modeltools_0.2-21   kernlab_0.9-25      lattice_0.20-35    
#>  [4] oompaData_3.1.1     htmltools_0.3.6     stats4_3.4.1       
#>  [7] viridisLite_0.3.0   yaml_2.1.16         rlang_0.2.2        
#> [10] prabclus_2.2-6      BiocGenerics_0.24.0 fpc_2.1-11         
#> [13] plyr_1.8.4          robustbase_0.92-8   stringr_1.3.1      
#> [16] munsell_0.4.3       gtable_0.2.0        mvtnorm_1.0-7      
#> [19] evaluate_0.10.1     labeling_0.3        Biobase_2.38.0     
#> [22] knitr_1.17          flexmix_2.3-14      parallel_3.4.1     
#> [25] class_7.3-14        DEoptimR_1.0-8      trimcluster_0.1-2  
#> [28] Rcpp_0.12.18        scales_0.5.0        backports_1.1.0    
#> [31] diptest_0.75-7      gridExtra_2.3       fastmatch_1.1-0    
#> [34] digest_0.6.12       stringi_1.1.7       grid_3.4.1         
#> [37] rprojroot_1.2       quadprog_1.5-5      tools_3.4.1        
#> [40] magrittr_1.5        lazyeval_0.2.1      tibble_1.3.4       
#> [43] cluster_2.0.6       whisker_0.3-2       pkgconfig_2.0.1    
#> [46] MASS_7.3-47         Matrix_1.2-14       rmarkdown_1.8      
#> [49] viridis_0.5.0       nnet_7.3-12         igraph_1.1.2       
#> [52] nlme_3.1-131        compiler_3.4.1

5. References